AITopics | multimodal latent space

Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Neural Information Processing SystemsDec-24-2025, 04:38:16 GMT

Discrete latent spaces in variational autoencoders have been shown to effectively capture the data distribution for many real-world problems such as natural language understanding, human intent prediction, and visual scene representation. However, discrete latent spaces need to be sufficiently large to capture the complexities of real-world data, rendering downstream tasks computationally challenging. For instance, performing motion planning in a high-dimensional latent representation of the environment could be intractable. We consider the problem of sparsifying the discrete latent space of a trained conditional variational autoencoder, while preserving its learned multimodality. As a post hoc latent space reduction technique, we use evidential theory to identify the latent classes that receive direct evidence from a particular input condition and filter out those that do not. Experiments on diverse tasks, such as image generation and human behavior prediction, demonstrate the effectiveness of our proposed technique at reducing the discrete latent sample space size of a model while maintaining its learned multimodality.

conditional variational autoencoder, evidential sparsification, multimodal latent space, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (0.98)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.88)

Add feedback

Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods

Park, Siwoo

arXiv.org Artificial IntelligenceAug-1-2025

This paper investigates the inverse capabilities and broader utility of multimodal latent spaces within task-specific AI (Artificial Intelligence) models. While these models excel at their designed forward tasks (e.g., text-to-image generation, audio-to-text transcription), their potential for inverse mappings remains largely unexplored. We propose an optimization-based framework to infer input characteristics from desired outputs, applying it bidirectionally across Text-Image (BLIP, Flux.1-dev) and Text-Audio (Whisper-Large-V3, Chatterbox-TTS) modalities. Our central hypothesis posits that while optimization can guide models towards inverse tasks, their multimodal latent spaces will not consistently support semantically meaningful and perceptually coherent inverse mappings. Experimental results consistently validate this hypothesis. We demonstrate that while optimization can force models to produce outputs that align textually with targets (e.g., a text-to-image model generating an image that an image captioning model describes correctly, or an ASR model transcribing optimized audio accurately), the perceptual quality of these inversions is chaotic and incoherent. Furthermore, when attempting to infer the original semantic input from generative models, the reconstructed latent space embeddings frequently lack semantic interpretability, aligning with nonsensical vocabulary tokens. These findings highlight a critical limitation. multimodal latent spaces, primarily optimized for specific forward tasks, do not inherently possess the structure required for robust and interpretable inverse mappings. Our work underscores the need for further research into developing truly semantically rich and invertible multimodal latent spaces.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2507.2301

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.94)

Add feedback

Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Neural Information Processing SystemsMay-27-2025, 03:12:47 GMT

Discrete latent spaces in variational autoencoders have been shown to effectively capture the data distribution for many real-world problems such as natural language understanding, human intent prediction, and visual scene representation. However, discrete latent spaces need to be sufficiently large to capture the complexities of real-world data, rendering downstream tasks computationally challenging. For instance, performing motion planning in a high-dimensional latent representation of the environment could be intractable. We consider the problem of sparsifying the discrete latent space of a trained conditional variational autoencoder, while preserving its learned multimodality. As a post hoc latent space reduction technique, we use evidential theory to identify the latent classes that receive direct evidence from a particular input condition and filter out those that do not. Experiments on diverse tasks, such as image generation and human behavior prediction, demonstrate the effectiveness of our proposed technique at reducing the discrete latent sample space size of a model while maintaining its learned multimodality.

artificial intelligence, machine learning, natural language, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Review for NeurIPS paper: Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Neural Information Processing SystemsJan-25-2025, 19:32:20 GMT

Weaknesses: I find three points of weakness that decrease the potential impact of the work: i) References are too focused on "application" papers and evidential theory, while authors want to present a new methodology for reducing the discrete latent space dimensionality in auto-encoders. Well, if authors include more references or comments about theoretical papers of VAEs, this work could be better contrasted with other similar works, and will potentially facilitate its disclosure.. ii) Apart from the references, authors fail on the fact of not including a short paragraph or subsection about the CVAE with a few details to refresh the ideas and having a work that is totally self-contained. They could have sacrificed half-page of experiments to described the conditional auto-encoder better. So, if the number 9 was badly compressed in the latent space, and then so many other dimensions removed, after re-normalising, the number 9 gets importance? is that what is happening? The other question is about Table 1 and the accuracy performance under the 50% in classification, pretty bad, right?

conditional variational autoencoder, evidential sparsification, multimodal latent space, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.97)

Add feedback

Review for NeurIPS paper: Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Neural Information Processing SystemsJan-25-2025, 19:32:12 GMT

All the reviewers were relatively positive about this paper but there were some concerns, mainly with regards to details of the experimental comparison and related work. These have been clarified during the rebuttal and the reviewers were happy to recommend acceptance.

conditional variational autoencoder, evidential sparsification, multimodal latent space, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.40)

Add feedback

Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Neural Information Processing SystemsOct-10-2024, 13:19:15 GMT

Discrete latent spaces in variational autoencoders have been shown to effectively capture the data distribution for many real-world problems such as natural language understanding, human intent prediction, and visual scene representation. However, discrete latent spaces need to be sufficiently large to capture the complexities of real-world data, rendering downstream tasks computationally challenging. For instance, performing motion planning in a high-dimensional latent representation of the environment could be intractable. We consider the problem of sparsifying the discrete latent space of a trained conditional variational autoencoder, while preserving its learned multimodality. As a post hoc latent space reduction technique, we use evidential theory to identify the latent classes that receive direct evidence from a particular input condition and filter out those that do not. Experiments on diverse tasks, such as image generation and human behavior prediction, demonstrate the effectiveness of our proposed technique at reducing the discrete latent sample space size of a model while maintaining its learned multimodality.

conditional variational autoencoder, evidential sparsification, multimodal latent space, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models

Paissan, Francesco, Farella, Elisabetta

arXiv.org Artificial IntelligenceNov-24-2023

Contrastive Language-Audio Pretraining (CLAP) became of crucial importance in the field of audio and speech processing. Its employment ranges from sound event detection to text-to-audio generation. However, one of the main limitations is the considerable amount of data required in the training process and the overall computational complexity during inference. This paper investigates how we can reduce the complexity of contrastive language-audio pre-trained models, yielding an efficient model that we call tinyCLAP. We derive an unimodal distillation loss from first principles and explore how the dimensionality of the shared, multimodal latent space can be reduced via pruning. TinyCLAP uses only 6% of the original Microsoft CLAP parameters with a minimal reduction (less than 5%) in zero-shot classification performance across the three sound event detection datasets on which it was tested

distillation process, encoder, latent space, (15 more...)

arXiv.org Artificial Intelligence

2311.14517

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.91)

Add feedback

Effect of The Latent Structure on Clustering with GANs

Mishra, Deepak, Jayendran, Aravind, P, Prathosh A.

arXiv.org Machine LearningMay-5-2020

Generative adversarial networks (GANs) have shown remarkable success in generation of data from natural data manifolds such as images. In several scenarios, it is desirable that generated data is well-clustered, especially when there is severe class imbalance. In this paper, we focus on the problem of clustering in generated space of GANs and uncover its relationship with the characteristics of the latent space. We derive from first principles, the necessary and sufficient conditions needed to achieve faithful clustering in the GAN framework: (i) presence of a multimodal latent space with adjustable priors, (ii) existence of a latent space inversion mechanism and (iii) imposition of the desired cluster priors on the latent space. We also identify the GAN models in the literature that partially satisfy these conditions and demonstrate the importance of all the components required, through ablative studies on multiple real world image datasets. Additionally, we describe a procedure to construct a multimodal latent space which facilitates learning of cluster priors with sparse supervision.

artificial intelligence, latent space, machine learning, (12 more...)

arXiv.org Machine Learning

doi: 10.1109/LSP.2020.2996935

2005.02435

Country: Asia > India > NCT > Delhi (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.49)

Add feedback

Filters

Collaborating Authors

multimodal latent space

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Investigating the Invertibility of Multimodal Latent Spaces: Limitations of Optimization-Based Methods

Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Review for NeurIPS paper: Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Review for NeurIPS paper: Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders

tinyCLAP: Distilling Constrastive Language-Audio Pretrained Models

Effect of The Latent Structure on Clustering with GANs